Feature selection methods and predictive models in CT lung cancer radiomics

Abstract Radiomics is a technique that extracts quantitative features from medical images using data‐characterization algorithms. Radiomic features can be used to identify tissue characteristics and radiologic phenotyping that is not observable by clinicians. A typical workflow for a radiomics study includes cohort selection, radiomic feature extraction, feature and predictive model selection, and model training and validation. While there has been increasing attention given to radiomic feature extraction, standardization, and reproducibility, currently, there is a lack of rigorous evaluation of feature selection methods and predictive models. Herein, we review the published radiomics investigations in CT lung cancer and provide an overview of the commonly used radiomic feature selection methods and predictive models. We also compare limitations of various methods in clinical applications and present sources of uncertainty associated with those methods. This review is expected to help raise awareness of the impact of radiomic feature and model selection methods on the integrity of radiomics studies.


INTRODUCTION
Radiomics is a technique that extracts quantitative features, termed as radiomic features, from medical images using data-characterization algorithms. 1 Radiomic features can be used to identify tissue characteristics and radiologic phenotyping that is not observable by clinicians. Although morphological image features have been used in clinical practice for many decades, the concept of radiomics was first proposed in 2012. Since then, radiomics studies have experienced an exponential growth.
We performed a literature search on publications from 2012 to the end of 2021 from PubMed. The use of the keyword "radiomics" returned 5579 results, adding "lung cancer" returned 911 results, then adding "CT" returned 560 results. Figure 1 illustrates the number of publications annually for the literature search. The search was from PubMed for "Radiomics" + "Lung Cancer" and "Radiomics" + "Lung Cancer" + "CT." The growth of the field of radiomics has been steadily increasing over the past decade. (Accessed on January 10,2022) applications in lung cancer. [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22] The review papers regarding image quality or feature reproducibility discussed neither feature selection methods nor predictive models. Review papers focusing on potential clinical applications generally mentioned basic model construction when discussing a study. Two of these review papers included a short summary of feature selection methods and predictive models based on a very limited number of sources. 7,11 To the best of our knowledge, there is no comprehensive review that examines feature selection methods and predictive models used in CT lung cancer radiomics. Feature selection methods and predictive models play a pivotal role in radiomics. Hence, we provide a review regarding this topic based on a thorough analysis of existing literature. Figure 2 shows a typical radiomics study pipeline. To begin, a patient cohort is chosen based on criteria appropriate to the study parameters (e.g., type of disease, treatment method, etc.). A patient cohort can be sourced from institutional data, public databases (e.g., The Cancer Imaging Archive), or a combination of both. Radiomic features are extracted from the cohort image sets using feature extraction software, often hundreds if not thousands of features are extracted. A feature selection process is then conducted to remove redundant and unstable features before the predictive model is ready to be trained. The datasets used to train a predictive model include a training and validation dataset, both originating from the patient cohort, and a test dataset, originating from an independent data source.

OVERVIEW
The independent test set is key to model development as it provides an unbiased evaluation of model performance.
The two main clinical applications of radiomics are classification and prognostics.Classification studies create a model that can stratify a cohort into distinct bins determined by the study. Typically, these studies are interested in determining malignancy, specific gene expression or phenotyping, or segmentation.  Prognostics studies create models that score individual cases along a sliding scale. The outputs of prognostic studies are heavily associated with future clinical outcomes such as treatment response, disease spread, or survival rates.   Figure 3 illustrates the annual quantity of publications for each application in this review.
It should be noted that we provide an overview of feature selection methods and predictive models and possible trends over time. This review does not intend to compare those methods nor to provide any recommendations for future study construction.  ClassificaƟon PrognosƟcs F I G U R E 3 Annual number of publications for classification and prognostic studies in CT lung cancer radiomics that were included in this review

General features
The extracted radiomic features typically fall under four main categories: shape, first-, second-, and higherorder features. Shape features indicate morphological characteristics of the ROI. First-order features are direct measurements of the voxel values, describing the distribution of intensities within the ROI. Second-order features, also referred to as texture features, provide descriptions of how voxels within a given ROI relate to each other. Higher-order features can be generated by applying filters (e.g., wavelet) to the ROI/image before extracting features.
Each category of feature aside from higher-order features contains a number of feature families, as seen in Table 2. Each of these feature families contain individual features that operate on the ROI in a similar manner. The number of commonly used individual features is over 100 and some of these exhibit spatial variations. There may be thousands of extracted features when all variations are considered.

Feature selection methods
The most common feature selection methods are listed in Table 3 along with a short explanation of how each one works. Single or multiple feature selection methods are employed in radiomic studies. An alternative approach is to directly extract a smaller, predetermined list of features. Feature selection methods select the most reliable and relevant features for model training through removing redundant information. This process will improve the robustness of the predictive model and help reduce the risk of overfitting as well as calculation time. 152 Table 4 shows the commonly used feature selection methods according to three categories: single, serial, and parallel feature selection methods. The "Single Method" category is comprised of studies that only use a single feature selection method. The "Serial Method" and "Parallel Method" categories refer to studies that use multiple feature selection methods, but with different approaches. "Serial Method" studies apply feature selection methods in sequence,each step reducing the dimensionality of the radiomic features further. "Parallel Method" studies simply test multiple "single method" independently to achieve better results. A feature selection method that was used only twice or less is categorized as "Other." The least absolute shrinkage and selection operator (LASSO) method is most widely used in the "Single Method" and "Serial Method" categories. It is a popular choice likely due to the strong covariate reduction capabilities. The intraclass correlation (ICC) and concordance correlation coefficient (CCC) are also widely used, most often as a first step in a "Serial Method"study. ICC and CCC assess the reproducibility of features and serve as a filtering step before additional feature selection methods are employed. In regards to the "Parallel Method" category, as many as 13 feature selection methods were employed in a single study. 59 In many cases, the feature selection methods chosen in "Parallel Method" studies were not widely used, resulting in the high frequency of "Other" feature selection methods in these studies.

Feature selection in classification
The feature selection methods used in classification studies were tallied and separated according to the year of publication. This helps to identify potential trends that could suggest the efficacy of a particular technique. Figure 4 shows that the feature selection method used most often in classification studies is LASSO. LASSO's first usage in CT lung cancer radiomic studies was in 2018 and became the most commonly used method in 2019 onwards. Maximum relevance-minimum redundancy (mRMR) was first used in 2017, though it was initially proposed in 2005. 153 The mRMR feature selection method is most often used in a serial manner, that is, using mRMR then using LASSO on the mRMR results. Among those commonly used feature selection methods in the classification studies, LASSO and ICC/CCC have clearly gained more attention over the past several years. Figure 5 shows the distribution of feature selection methods in prognostic studies, which is very similar to that of classification studies. This suggests that the approach to both classification and prognostics studies is quite similar at the feature selection stage. The feature TA B L E 3 Commonly used feature selection methods by the studies included in this review The breakdown of the most common feature selection methods in "Single Method," "Serial Method," and "Parallel Method" studies is displayed. In many cases, the feature selection methods chosen in "Parallel Method" studies were not widely used, resulting in the high frequency of "Other" feature selection methods in these studies. a" Other" indicates feature selection methods that were used twice or less.

Feature selection method Mechanism
selection method that has become most commonly used in both classification and prognostic studies is LASSO.

Common features for lung cancer
While there are hundreds of commonly extracted radiomic features and many more niche features, the majority are removed during feature selection. According to reviewed studies, there are not many common features in CT lung radiomic studies. In classification studies, there are nine features (mean, median, skewness, cluster shade, cluster prominence, entropy, correlation, imc1, imc2) that were selected more than 10 times in 79 total studies. In prognostic studies, there are three features (cluster shade, cluster prominence, skewness) selected more than 10 times in 51 total studies. The wide range of applications within classification and prognostics may diminish the appearance of common features. Step Other F I G U R E 5 Feature selection method distribution by year for prognostic studies. The methods most often used were LASSO and correlation. The "Other" category encompasses feature selection methods that were used two or less times in the review even they are likely redundant, for different potential applications. Table 6 shows the statistics for the number of features selected for predictive model training across the reviewed studies. The average number of features is <10, though some studies have selected as many as 50 and one study found that semantic features alone performed better in predictive model training. 48 Two studies were excluded from this analysis as they both utilized a very large number of radiomic features to train predictive models and these outliers would skew the datapoints. The outliers were a classification study that selected 115 features 74 and a prognostics study that selected 149 features. 136

Limited feature agreement
As seen in Table 5B, there is a limited agreement in features in both classification or prognostic studies, with cluster shade, cluster prominence, and skewness being the features with the highest occurrence out of hundreds of individual features. The wide variety of applications in radiomics may contribute to the limited number of common features found in this review,though additional work should be conducted to determine if the common features have application-related dependencies or if there are no strong associations.

Correlation cutoff values
ICC/CCC and Pearson/Spearman correlation are heavily used in radiomic studies, often to provide an initial filtering of radiomic features based on reproducibility and redundancy, respectively. Even though this review finds that these methods have a relatively widespread use, it is unclear as to why one, both, or neither method is chosen in any given study. The methods address two major concerns about radiomic feature values. [157][158][159][160][161] but are not as ubiquitous as one might TA B L E 5 (A) Three filters, wavelet, LoG, and original, that are most commonly applied to feature families to generate high-order radiomic features, based on features selected by the feature selection methods in the reviewed studies. A total count indicates the overall use of the filter among selected features. (B) Individual radiomic features and their respective number of appearances for each of the three most commonly used feature families: GLCM, first-order, and GLSZM. A total count indicates the overall use of the feature family among selected features Typically, a radiomics study selects less than 10 features despite the fact that many studies extract hundreds of radiomic features before feature selection.

A) Classification Prognostics
consistent correlation thresholds and as such may be worth investigating in future work.

Feature harmonization
Harmonization of radiomic feature values is a technique that is not consistently used in radiomic studies. The performance of feature selection methods and predictive models is directly affected when feature values are adjusted. 162 It is also important to consider that the effects of harmonization are likely feature-dependent, since features produce values that range across multiple orders of magnitude. Future efforts should be made to fully investigate the need and benefits of this practice.

General predictive models
Typically, predictive models employ semantic data (e.g., malignancy status, smoking status, age, survival time, etc.), radiomic features, or a combination of the two. Many studies use multiple models with different combinations of semantic and radiomic data.After the data are input into the model for training, the machine learning process optimizes the parameters to produce the best fit to the data. Table 7 provides an overview of the predictive models used in the reviewed studies. The studies are separated based on the number of models tested, either single or multiple models. When looking at the entirety of the reviewed studies, regression is the most commonly used predictive model in CT lung cancer radiomic studies, among which logistic regression comprises a vast majority. Other commonly used models after regression are support vector machine (SVM), random forest (RF), and Cox's proportional hazard model. Very often a study will use multiple modeling methods to test which method will produce better results. This process has been made relatively easy by available software, which can provide multiple options for modeling without much increase in effort.

Commonly used predictive models in CT lung cancer radiomics
When looking at classification studies, classifiers are unsurprisingly the model of choice since the goal of the predictive model is to classify inputs into distinct groups. Figure 6 shows the distribution of models used in CT lung cancer radiomic studies published during the review periods. The top three commonly used models for classification over time are regression, SVM, and RF. It shows that there is an increasing trend in the use of regression.
When looking at prognostic studies, as seen in Figure 7, studies still favor regression and RF as models. The primary difference from classification studies is the use of the Cox's proportional hazard model, since it is suited for variable outputs. This shows the general trend that logistic regression has in parsing large patient cohorts.

Cohort power
For radiomic studies, datasets can be sourced from public datasets (e.g., TCIA, RIDER, NLST, public clinical trials) or from institutional data collected for the purpose of the study. The decision to conduct a study using institutional data, public data, or a combination of both depends on the goals of the study and the method used to validate the predictive model. Table 8A shows the usage of institutional and public data in the reviewed studies. The number of datapoints available in a chosen cohort plays an important role in determining the statistical power of a study.

Training and validation
Once patient cohorts are determined and radiomic features are selected, predictive model training is the next step. Some studies generate a weighted radiomic signature prior to incorporating semantic features, which condenses the selected radiomic features into a single term. In these situations, concerns of double-training should be addressed, as the selected radiomic features are trained on the dataset to determine weighting prior to model training then used again as a radiomic signature during model training. Future considerations should be made to determine if a weighted radiomic signature has adverse effects on model performance compared to the direct use of the selected features.
Once model training is complete, best practice is to perform model validation followed by independent testing using an external dataset to assess model performance outside the training data. Some radiomic studies forgo validation and testing entirely, instead simply presenting the training results in a standalone fashion. This methodology lacks both validation and independent testing and results should be treated with some skepticism. Other studies will perform a validation step using a training and validation dataset originating from the patient cohort, with varying ratios between the two datasets. This is often done as a multiple-fold crossvalidation that can be used to estimate the fit of the model on new data, but no technique can replace testing with independent datasets. Table 8C characterizes how studies approach validation and testing and shows that 11 of 129 reviewed studies employ independent testing. Table 8D shows that there is no consistency in how patient cohorts are split, though the effects of split ratio on model performance in radiomics are unclear.
Once a model is trained,the performance of the model is reported as results. Currently, there is no standard reporting method in radiomic studies for results when multiple feature selection methods and/or multiple models are used in a study. Some studies opt to only report the best-performing combination while others display the results from all combinations used. The latter reporting style should be considered as standard in future works as a complete view of how specific methods interact and perform will help future studies choose more optimal techniques.

CONCLUSION
Currently, radiomics and deep learning are the most researched techniques in medical imaging. The handcrafted radiomic analysis necessitates the use of machine learning or statistical algorithms after feature extraction in order to construct a predictive model. According to our review of existing publications, their performance highly depends on the feature selection methods and prediction models used for the analysis. While efforts have been focused on the standardization of imaging biomarkers by addressing radiomic feature reproducibility and stability, the evaluation and validation of feature selection methods and predictive modeling to strengthen the stability and reproducibility of radiomic features remain a necessity. The future refinement of radiomics need to investigate those limitations discussed above, including but not limited to feature extraction consistency, feature value harmonization, cohort power, and training/validation best practices. With the improvement of its robustness, radiomic analysis can become an efficient tool for aiding clinicians in risk stratification, prognostics, and patient management.

AU T H O R C O N T R I B U T I O N S
All listed authors contributed to the literature search and to drafting the manuscript.

C O N F L I C T O F I N T E R E S T
The authors declare that there is no conflict of interest.